See quick reference at the bottom
See full module reference section for full details
In the begining of each analysis, the first step is to load ReproPhylo and its dependencies with the command
In [1]:
from reprophylo import *
Once this is done we can start a Project
. A Project
contains all the data, metadata, methods and environment information, and it is the unit that is saved as a pickle file, which is version controled with Git.
Although ReproPhylo is designed to record versions and update the pickle file automatically, this will be opt-out of in this tutorial, and will be introduced after we have covered the basics.
Instead, we will manually save a pickle file at the end of each section, and will load it in the next one. You should use the same pickle file name at the end of all the sections. The new content will be added to the one already present in the file.
If you want to jump ahead, there are presaved pickle files (Tutorial_files/basic/outputs
), numbered according to the section after which they were saved. For example, outputs/3.6.alignments.pkpj
was saved at the end of section 3.6 and can be loaded at the top of section 3.7, instead of your own file.
To start a Project
, we have to specify the loci to analyse (not actual sequence data, only some information on the loci) and a pickle file name.
A Locus
can be described manually using a command or by providing a file. For each Locus
, we have to specify the character type (DNA or protein) the feature type (eg, rRNA, CDS or gene), the name of the locus (eg, MT-CO1) and other possible aliases which may come handy if we want to read a genbank file (eg, cox1, coi).
Describe loci using a command
In [2]:
coi = Locus(char_type='dna',
feature_type='CDS',
name='MT-CO1',
aliases=['cox1', 'coi'])
This is a single Locus
description (a Locus
object). We can confirm its content by printing it like this:
In [3]:
print coi
Describing loci using a file
Another way of describing loci is to write them in a file. The file has one line for each Locus
, where each line has at least four items, separated by commas. The items, as above, are the character type, the feature type, the name of the locus and other possible aliases. At least one alias must be specified, but it can be identical to the name. For the MT-CO1 Locus
, a file would look like this:
dna,CDS,MT-CO1,cox1,coi
Deducing a loci file from a genbank file
A third way of describing loci is to run a command that guesses them from a genbank file and writes them into a comma delimited file, as above. This file can be used as is, or it can be edited. The following command will prepare such a loci file from a genbank file containing all the GenBank records belonging to the sponge family Tetillidae. Text starting with a hash (#
) is a comment which do not affect the command:
In [4]:
list_loci_in_genbank('data/Tetillidae.gb', # The input genbank
# file
'data/loci.csv', # The loci file
'outputs/loci_counts.txt') # Additional
# output,
# discussed
# below.
The command generated the loci file and wrote it in data/loci.csv
. Here are some excerpts separated by three dots:
dna,rRNA,18s,18S ribosomal RNA,18S rRNA dna,rRNA,28s,28S large subunit ribosomal RNA,28S ribosomal RNA ... dna,CDS,MT-ATP8,atp8,ATP8 dna,CDS,MT-CO1,coi,COI,cox1,COX1,coxI ... dna,rRNA,rnl,rnl dna,rRNA,rns,rns dna,rRNA,rrnL,rrnL
Each line represents a locus that was found in the genbank file data/Tetillidae.gb
. For some genes, such as 18s, synonyms were recognized and placed as aliases in one line. In other cases, such as for rnl
and rrnL
, they were not.
Editing the loci file
Possible edits to this file include:
dna,rRNA,rnl,rnl
dna,rRNA,rrnL,rrnL
will become
dna,rRNA,rnl,rnl,
9
dna,rRNA,rrnL,rrnL
,9
Which integer is written is unimportant, as long as it is shared between synonymous lines.
dna
to prot
, as such: prot
,CDS,MT-CO1,coi,COI,cox1,COX1,coxI
.
This will tell the program to use protein sequences instead of DNA sequence. The sequence alignment tutorial explains how to use both protein and DNA sequence of the same locus to conduct codon alignment.
The second file that the command above produced, the outputs/loci_counts.txt
, contains a list of the loci found in the genbank file, with the number of their occurances. This can be used as a guide when desciding which loci to delete and which to keep.
Project
Loading Locus
objects
First we'll make another Locus
object to make a point that more than one can be read:
In [5]:
ssu = Locus('dna','rRNA','18S',['ssu','SSU-rRNA'])
Regardless of whether we have one or more Locus
objects, they are read as a list, which means that they are wrapped with square brackets and separated by comma:
In [6]:
loci_list = [coi, ssu]
This command will start the Project
and will write it to the pickle file outputs/dummy.pkpj
:
pj = Project(loci_list, pickle='outputs/dummy.pkpj')
This following alternative will start a Project
and will load the loci from a file data/edited_loci.csv
that looks like this:
dna,rRNA,18s,18S ribosomal RNA,18S rRNA dna,rRNA,28s,28S large subunit ribosomal RNA dna,CDS,MT-CO1,coi,COI,cox1,COX1,coxI
In [7]:
pj = Project('data/edited_loci.csv',
pickle='outputs/my_project.pkpj', git=False)
This will provoke a bunch of Git related messages which will be discussed in the version control section of this tutorial.
If we print the Project
we'll get this massage:
In [8]:
print pj
Project
As you have seen, when you start a Project you pass a list of loci or a csv file name with the loci attributes:
pj = Project(loci_list, pickle='filename')
Once the Project
exists, it is possible to modify the Locus
objects it contains. To add a Locus
, you need to create it, as you have done:
lsu = Locus('dna', 'rRNA', '28S', ['28s','LSU-rRNA'])
and then also add it to the Project
. Loci are stored in a list called pj.loci
. So the new Locus
can be appended to it:
pj.loci.append(ssu)
or if we have a list of new loci to add, for example:
new_loci_list = [nd5, lsu]
it can be added to the loci list like so:
pj.loci += new_loci_list
Lastly, we can modify loci that are already in pj.loci
. For example, change the name and add an alias to the MT-CO1
Locus
object:
for l in pj.loci: # Find the Locus named MT-CO1 if l.name == 'MT-CO1': l.name = 'COI' # Rename it to COI l.aliases.append('coi') # Add the alias coi
In [11]:
# Update the pickle file
pickle_pj(pj, 'outputs/my_project.pkpj')
Out[11]:
In [ ]:
# A Locus object
coi = Locus(char_type='dna', # or 'prot'
feature_type='CDS', # any string
name='MT-CO1', # any string
aliases=['coi', 'cox1']) # list of strings
# Guess loci.csv file from a genbank file
list_loci_in_genbank('genbank.gb',
'loci.csv',
'loci_counts.txt')
# Start a Project
# With a Locus object list
pj = Project([coi, ssu], pickle='pickle_filename')
# With a loci.csv file
pj = Project('loci.csv', pickle='pickle_filename')
# Add a Locus to an existing Project
pj.loci.append(coi)
#Or
pj.loci += [coi]
# Modify a Locus existing in a Project
for l in pj.loci:
if l.name == 'MT-CO1':
l.name = 'newName'
l.feature_type = 'newFeatureType'
l.char_type = 'prot'
l.aliases.append('newAlias')
#Or
l.aliases += ['newAlias1,newAlias2']